Indexing Google 1T for low-turnaround wildcarded frequency queries
نویسنده
چکیده
We propose a technique to prepare the Google 1T n-gram data set for wildcarded frequency queries with a very low turnaround time, making unbatched applications possible. Our method supports token-level wildcarding and – given a cache of 3.3 GB of RAM – requires only a single read of less than 4 KB from the disk to answer a query. We present an indexing structure, a way to generate it, and suggestions for how it can be tuned to particular applications. 1 Background and motivation The “Google 1T” data set (LDC #2006T13) is a collection of 2-, 3-, 4-, and 5-gram frequencies extracted at Google from around 1012 tokens of raw web text. Wide access to web-scale data being a relative novelty, there has been considerable interest in the research community in how this resource can be put to use (Bansal and Klein, 2011; Hawker et al., 2007; Lin et al., 2010, among others). We are concerned with facilitating approaches where a large number of frequency queries (optionally with token-by-token wildcarding) are made automatically in the context of a larger natural language-based system. Our motivating example is Bansal and Klein (2011) who substantially improve statistical parsing by integrating frequencybased features from Google 1T, taken as indicative of associations between words. In this work, however, parser test data is preprocessed “off-line” to make n-gram queries tractable, hampering the practical utility of this work. Our technique eliminates such barriers to application, making it feasible to answer previously unseen wildcarded frequency queries “on-line”, i.e. when parsing new inputs. We devise a structure to achieve this, making each query approximately the cost of a single random disk access, using an in-memory cache of about 3 GB. Our own implementation will be made available to other researchers as open source.
منابع مشابه
Introducing Linggle: From Concordance to Linguistic Search Engine
We introduce a Web-scale linguistics search engine, Linggle, that retrieves lexical bundles in response to a given query. Unlike a typical concordance, Linggle accepts queries with keywords, wildcard, wild part of speech (PoS), synonymous words, and additional regular expression (RE) operators, and returns bundles with frequency counts. In our approach, we argument Google Web 1T corpus with inv...
متن کاملGoogle Web 1T 5-Grams Made Easy (but not for the computer)
This paper introduces Web1T5-Easy, a simple indexing solution that allows interactive searches of the Web 1T 5-gram database and a derived database of quasi-collocations. The latter is validated against co-occurrence data from the BNC and ukWaC on the automatic identification of non-compositional VPC.
متن کاملLinggle: a Web-scale Linguistic Search Engine for Words in Context
In this paper, we introduce a Web-scale linguistics search engine, Linggle, that retrieves lexical bundles in response to a given query. The query might contain keywords, wildcards, wild parts of speech (PoS), synonyms, and additional regular expression (RE) operators. In our approach, we incorporate inverted file indexing, PoS information from BNC, and semantic indexing based on Latent Dirichl...
متن کاملUsing Lexical Patterns in the Google Web 1T Corpus to Deduce Semantic Relations Between Nouns
This paper investigates methods for using lexical patterns in a corpus to deduce the semantic relation that holds between two nouns in a noun-noun compound phrase such as “flu virus” or “morning exercise”. Much of the previous work in this area has used automated queries to commercial web search engines. In our experiments we use the Google Web 1T corpus. This corpus contains every 2,3, 4 and 5...
متن کاملبررسی میزان همخوانی عبارتهای جستجوی کاربران با اصطلاحات پیشنهادی مقالات در پیشینههای کتابشناختی پایگاههای اطلاعاتی لاتین EBSCO و IEEE
Purpose: This study aims to investigate correspondence of users' queries with alternative terms of Latin databases namely IEEE and EBSCO. Databases display subjective content of their documents through natural or controlled language vocabularies in specified bibliographic fields along with other bibliographic information that are called papers alternative terms. Methodology: We used content an...
متن کامل